With the recent emergence of new online betting apps and websites, there has been an overall growing interest in the sports betting world. I myself have used some of these sports betting outlets, legally of course, and gained in interest in the use of statistics in combination with betting odds to potentially predict the winner of an NFL game.
Therefore, the main goal of this project is to attempt to build a machine learning model that has the ability to predict whether the home or visiting team will win an NFL game using a binary classification model.
Due to my own personal interest and involvement with sports betting, I combined historical data and statistics from multiple websites across the internet on my own in order to create a data set that had enough predictors and observations to be applied to a machine learning project of this style.
This dataset, titled “nfl_szn_2018_to_2019.csv”, is a csv file that includes all of the NFL regular season and playoff games played in the 2018 - 2019 NFL season.
In terms of obtaining the data, I was able to get the historical betting odds, final score, and each game with respective date from: https://www.sportsbookreviewsonline.com/scoresoddsarchives/nfl-odds-2018-19/.
For the rest of the historical data involving previous records and percentage I used: https://www.pro-football-reference.com/years/2018/. and added to the dataset using mutate in r.
For running totals I used R to group and sum cumulatively using the R
command mutate.
For injuries I went week by week and manually inputted each team’s number of injured players from: https://www.nfl.com/injuries/league/2018/REG1.
Let’s create a roadmap for how we will strive to reach our goal of predicting NFL games now that we have an initial grasp on the data we will be working with. This will be a multi-step process.
This process begins with cleaning and organizing our data by making necessary data adjustments. In specific we will load the data set, check for missing data and tidying the data set.
Then, we will examine how the data set and its predictor variables are constructed and related by way of its structure, summary, definition of variables, and visual EDA emphasizing plots, graphs, and correlation matrices.
Next, we will split the data into training and testing sets, make a recipe, and set folds for cross-validation.
Logistic Regression, K-Nearest Neighbors, Linear Discriminant
Analysis, Quadratic Discriminant Analysis, Elastic Net - Lasso, Decision
Trees, Random Forest, Boosted Trees and Support Vector Machine models
will all be built and trained on the training we split. We will
determine the best model for each model, analyze the results for all
models, rank them by the performance metrics of accuracy
and roc_auc values, and produce plots for each model.
Lastly, fit the testing data set to the model with the best
roc_auc value and the to model with the best
accuracy value respectively. Hopefully either or both
models will give us positive results and precisely predict the NFL games
in the testing data set.
Let’s begin this machine learning project by introducing our data, making any needed changes, and examining the different aspects of how our data is compiled in this specific data set. It is crucial that we understand each aspect of our data set in order to ensure we can interpret every part of our project.
Before we can even begin our project and build anything we must first load in our data and make the necessary adjustments and checks to verify that our data is not only applicable to the project but is also organized and labelled in a way that makes the data easy to manipulate and analyze once we have completed more detailed and thorough steps.
To start, we will load in our data set without the counter column.
# to read in our csv data
# also drops unneeded counter column
nfl_bet <- suppressMessages(read_csv("nfl_szn_2018_to_2019.csv",
col_types = cols(...1 = col_skip())))
Although I created this data set, it is always necessary to check for missing data. Missing data could create problems further down the line of our analysis, model building, and predicting.
# plot of missing values in the data
nfl_bet %>%
vis_dat()
From the visual representation of our data there does not seem to be any missing data. Just to be safe we will check all observations as well.
sum(is.na(nfl_bet))
## [1] 0
Perfect, we have zero observations with missing data.Let’s move on to tidying and manipulating our data set.
Once again, even though I created this data set there are still variables that need to be removed and certain tidying of the data that needs to take place before we can progress with our project.
First, because we are only concerned with NFL games that finished with a winner, we will remove any NFL games that ended in a tie.
# creates a subset of the data that excludes any games that ended in a tie
nfl_bet <- subset(nfl_bet,Winner != "Tie")
Second, we will adjust the divisions so that the division column can be can be converted to a factor type.
# converts each division into region of the United States
nfl_bet <- nfl_bet %>%
mutate(Division_V = ifelse(Division_V == "AFCNorth", "North", Division_V)) %>%
mutate(Division_V = ifelse(Division_V == "AFCSouth", "South", Division_V)) %>%
mutate(Division_V = ifelse(Division_V == "AFCEast", "East", Division_V)) %>%
mutate(Division_V = ifelse(Division_V == "AFCWest", "West", Division_V)) %>%
mutate(Division_H = ifelse(Division_H == "AFCNorth", "North", Division_H)) %>%
mutate(Division_H = ifelse(Division_H == "AFCSouth", "South", Division_H)) %>%
mutate(Division_H = ifelse(Division_H == "AFCEast", "East", Division_H)) %>%
mutate(Division_H = ifelse(Division_H == "AFCWest", "West", Division_H)) %>%
mutate(Division_V = ifelse(Division_V == "NFCNorth", "North", Division_V)) %>%
mutate(Division_V = ifelse(Division_V == "NFCSouth", "South", Division_V)) %>%
mutate(Division_V = ifelse(Division_V == "NFCEast", "East", Division_V)) %>%
mutate(Division_V = ifelse(Division_V == "NFCWest", "West", Division_V)) %>%
mutate(Division_H = ifelse(Division_H == "NFCNorth", "North", Division_H)) %>%
mutate(Division_H = ifelse(Division_H == "NFCSouth", "South", Division_H)) %>%
mutate(Division_H = ifelse(Division_H == "NFCEast", "East", Division_H)) %>%
mutate(Division_H = ifelse(Division_H == "NFCWest", "West", Division_H))
Third, we will remove the Date and Game_ID
columns because they are not needed predictors, nor do they help
understand and analyze the data set for our concerns.
# removes the Date and Game_ID column
nfl_bet <- nfl_bet %>%
select(-Date) %>%
select(-GameID)
Fourth, we will clean the variable names by using
clean_names() so that all of our data columns are uniform
and easier to work with.
# cleans up the variable names
nfl_bet <- nfl_bet %>%
as_tibble() %>%
clean_names()
Fifth, we will convert all our variables, expect the NFL team names, that are character type into factor type so that they can we used as predictors.
# converts the winner variable of each game from a character to a factor
nfl_bet$winner <- nfl_bet$winner %>%
as.factor()
# converts the same_div variable of each game from a character to a factor
nfl_bet$same_div <- nfl_bet$same_div %>%
as.factor()
# converts the same_conf variable of each game from a character to a factor
nfl_bet$same_conf <- nfl_bet$same_conf %>%
as.factor()
# converts the off_bye_wk_v variable of each game from a character to a factor
nfl_bet$off_bye_wk_v <- nfl_bet$off_bye_wk_v %>%
as.factor()
# converts the off_bye_wk_h variable of each game from a character to a factor
nfl_bet$off_bye_wk_h <- nfl_bet$off_bye_wk_h %>%
as.factor()
# converts the division_v variable of each game from a character to a factor
nfl_bet$division_v <- nfl_bet$division_v %>%
as.factor()
# converts the division_h variable of each game from a character to a factor
nfl_bet$division_h <- nfl_bet$division_h %>%
as.factor()
# converts the conference_h variable of each game from a character to a factor
nfl_bet$conference_h <- nfl_bet$conference_h %>%
as.factor()
# converts the conference_v variable of each game from a character to a factor
nfl_bet$conference_v <- nfl_bet$conference_v %>%
as.factor()
Now our data has been adjusted correctly for the project and we will
save it as an rda file using the R command save() so that
it can be loaded into different R files for the model building process
of this project.
save(nfl_bet,file="nfl_bet.rda")
Since our data has been adjusted correctly, we will examine the attributes, structure, and summary of our data.
Below is a summary of our data:
# looking at summary of data set
nfl_bet %>%
summary()
## week same_div same_conf team_v team_h
## Min. : 1.000 N:169 N: 64 Length:264 Length:264
## 1st Qu.: 5.000 Y: 95 Y:200 Class :character Class :character
## Median : 9.500 Mode :character Mode :character
## Mean : 9.455
## 3rd Qu.:14.000
## Max. :20.000
## final_v final_h ml_v ml_h
## Min. : 0.00 Min. : 0.00 Min. :-1200.00 Min. :-2000.0
## 1st Qu.:16.00 1st Qu.:17.00 1st Qu.: -142.75 1st Qu.: -300.0
## Median :22.00 Median :24.00 Median : 145.00 Median : -170.0
## Mean :22.22 Mean :24.42 Mean : 80.31 Mean : -161.4
## 3rd Qu.:28.00 3rd Qu.:31.00 3rd Qu.: 240.00 3rd Qu.: 122.8
## Max. :51.00 Max. :54.00 Max. : 1000.00 Max. : 750.0
## injuries_v injuries_h division_v division_h conference_v
## Min. : 0.000 Min. : 2.000 East :68 East :66 AFC:132
## 1st Qu.: 7.000 1st Qu.: 7.000 North:62 North:64 NFC:132
## Median : 9.000 Median : 9.000 South:66 South:67
## Mean : 9.568 Mean : 9.818 West :68 West :67
## 3rd Qu.:12.000 3rd Qu.:13.000
## Max. :21.000 Max. :22.000
## conference_h szn_win_per_v szn_win_per_h ud_tot_v
## AFC:132 Min. :0.0000 Min. :0.0000 Min. : 0.000
## NFC:132 1st Qu.:0.3570 1st Qu.:0.3330 1st Qu.: 1.000
## Median :0.5000 Median :0.5000 Median : 3.000
## Mean :0.5047 Mean :0.4947 Mean : 3.686
## 3rd Qu.:0.6670 3rd Qu.:0.6670 3rd Qu.: 6.000
## Max. :1.0000 Max. :1.0000 Max. :13.000
## ud_tot_h fv_tot_v fv_tot_h ud_win_per_v
## Min. : 0.000 Min. : 0.000 Min. : 0.000 Min. :0.000
## 1st Qu.: 1.000 1st Qu.: 1.000 1st Qu.: 1.000 1st Qu.:0.200
## Median : 3.000 Median : 3.000 Median : 3.000 Median :0.400
## Mean : 3.761 Mean : 4.027 Mean : 3.989 Mean :0.414
## 3rd Qu.: 6.000 3rd Qu.: 6.000 3rd Qu.: 6.000 3rd Qu.:0.500
## Max. :13.000 Max. :17.000 Max. :16.000 Max. :1.000
## ud_win_per_h fv_win_per_v fv_win_per_h prev_win_rec_v
## Min. :0.000 Min. :0.0000 Min. :0.0000 Min. : 0.000
## 1st Qu.:0.200 1st Qu.:0.5000 1st Qu.:0.5000 1st Qu.: 1.750
## Median :0.500 Median :0.5584 Median :0.5000 Median : 3.000
## Mean :0.419 Mean :0.6081 Mean :0.5914 Mean : 4.027
## 3rd Qu.:0.500 3rd Qu.:0.8000 3rd Qu.:0.8000 3rd Qu.: 6.000
## Max. :1.000 Max. :1.0000 Max. :1.0000 Max. :14.000
## prev_win_rec_h prev_loss_rec_v prev_loss_rec_h hc_win_per_v
## Min. : 0.000 Min. : 0.000 Min. : 0.000 Min. :0.1132
## 1st Qu.: 1.000 1st Qu.: 1.000 1st Qu.: 2.000 1st Qu.:0.4243
## Median : 3.000 Median : 3.000 Median : 3.000 Median :0.5184
## Mean : 4.011 Mean : 3.795 Mean : 3.837 Mean :0.5175
## 3rd Qu.: 6.000 3rd Qu.: 6.000 3rd Qu.: 6.000 3rd Qu.:0.5860
## Max. :14.000 Max. :12.000 Max. :11.000 Max. :3.0377
## hc_win_per_h hc_g_v hc_g_h prev_szn_win_per_v
## Min. :0.1048 Min. : 13.00 Min. : 12.00 Min. :0.0000
## 1st Qu.:0.4246 1st Qu.: 66.75 1st Qu.: 68.75 1st Qu.:0.3595
## Median :0.5208 Median : 92.50 Median : 94.50 Median :0.5630
## Mean :0.5199 Mean :135.18 Mean :141.17 Mean :0.5013
## 3rd Qu.:0.5887 3rd Qu.:220.50 3rd Qu.:233.25 3rd Qu.:0.6250
## Max. :3.0769 Max. :455.00 Max. :459.00 Max. :0.8130
## prev_szn_win_per_h off_bye_wk_v off_bye_wk_h winner
## Min. :0.0000 N:242 N:254 H:158
## 1st Qu.:0.3750 Y: 22 Y: 10 V:106
## Median :0.5630
## Mean :0.5056
## 3rd Qu.:0.6250
## Max. :0.8130
Below is the distribution of our data:
# looking at distribution of dataset
nfl_bet %>%
str()
## tibble [264 × 38] (S3: tbl_df/tbl/data.frame)
## $ week : num [1:264] 1 1 1 1 1 1 1 1 1 1 ...
## $ same_div : Factor w/ 2 levels "N","Y": 1 1 1 1 1 2 1 1 2 1 ...
## $ same_conf : Factor w/ 2 levels "N","Y": 2 2 2 2 1 2 2 2 2 1 ...
## $ team_v : chr [1:264] "Atlanta" "SanFrancisco" "Cincinnati" "Buffalo" ...
## $ team_h : chr [1:264] "Philadelphia" "Minnesota" "Indianapolis" "Baltimore" ...
## $ final_v : num [1:264] 12 16 34 3 20 48 20 20 38 24 ...
## $ final_h : num [1:264] 18 24 23 47 15 40 27 27 28 27 ...
## $ ml_v : num [1:264] -119 230 -115 290 -145 400 225 -110 160 160 ...
## $ ml_h : num [1:264] -101 -280 -105 -380 125 -550 -275 -110 -190 -190 ...
## $ injuries_v : num [1:264] 8 5 7 3 5 7 8 5 5 11 ...
## $ injuries_h : num [1:264] 6 8 7 4 3 2 6 7 3 8 ...
## $ division_v : Factor w/ 4 levels "East","North",..: 3 4 2 1 3 3 3 3 4 4 ...
## $ division_h : Factor w/ 4 levels "East","North",..: 1 2 3 2 1 3 1 1 4 4 ...
## $ conference_v : Factor w/ 2 levels "AFC","NFC": 2 2 1 1 1 2 1 1 1 2 ...
## $ conference_h : Factor w/ 2 levels "AFC","NFC": 2 2 1 1 2 2 1 1 1 1 ...
## $ szn_win_per_v : num [1:264] 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
## $ szn_win_per_h : num [1:264] 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
## $ ud_tot_v : num [1:264] 0 0 0 0 0 0 0 0 0 0 ...
## $ ud_tot_h : num [1:264] 0 0 0 0 0 0 0 0 0 0 ...
## $ fv_tot_v : num [1:264] 0 0 0 0 0 0 0 0 0 0 ...
## $ fv_tot_h : num [1:264] 0 0 0 0 0 0 0 0 0 0 ...
## $ ud_win_per_v : num [1:264] 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
## $ ud_win_per_h : num [1:264] 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
## $ fv_win_per_v : num [1:264] 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
## $ fv_win_per_h : num [1:264] 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
## $ prev_win_rec_v : num [1:264] 0 0 0 0 0 0 0 0 0 0 ...
## $ prev_win_rec_h : num [1:264] 0 0 0 0 0 0 0 0 0 0 ...
## $ prev_loss_rec_v : num [1:264] 0 0 0 0 0 0 0 0 0 0 ...
## $ prev_loss_rec_h : num [1:264] 0 0 0 0 0 0 0 0 0 0 ...
## $ hc_win_per_v : num [1:264] 0.522 0.589 0.521 0.685 0.402 0.438 0.488 0.575 0.638 0.589 ...
## $ hc_win_per_h : num [1:264] 0.539 0.566 0.477 0.611 0.286 0.654 0.609 0.391 0.429 0.313 ...
## $ hc_g_v : num [1:264] 69 265 240 89 82 32 84 73 378 90 ...
## $ hc_g_h : num [1:264] 89 113 65 234 49 442 233 64 49 16 ...
## $ prev_szn_win_per_v: num [1:264] 0.625 0.375 0.438 0.563 0.625 0.313 0.25 0.563 0.625 0.563 ...
## $ prev_szn_win_per_h: num [1:264] 0.813 0.813 0.25 0.563 0.188 0.688 0.813 0.375 0.563 0.313 ...
## $ off_bye_wk_v : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...
## $ off_bye_wk_h : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...
## $ winner : Factor w/ 2 levels "H","V": 1 1 2 1 2 2 1 1 2 1 ...
Below are the dimensions of our data:
# dimensions of data
nfl_bet %>% dim()
## [1] 264 38
Thus, we can observe that our data set contains 264 observations with 38 columns.
Finally, we will take a look at the first 10 rows of our data to visualize what our data looks like from a table perspective.
head(nfl_bet)
## # A tibble: 6 × 38
## week same_div same_conf team_v team_h final_v final_h ml_v ml_h injuries_v
## <dbl> <fct> <fct> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 N Y Atlanta Phila… 12 18 -119 -101 8
## 2 1 N Y SanFra… Minne… 16 24 230 -280 5
## 3 1 N Y Cincin… India… 34 23 -115 -105 7
## 4 1 N Y Buffalo Balti… 3 47 290 -380 3
## 5 1 N N Jackso… NYGia… 20 15 -145 125 5
## 6 1 Y Y TampaB… NewOr… 48 40 400 -550 7
## # ℹ 28 more variables: injuries_h <dbl>, division_v <fct>, division_h <fct>,
## # conference_v <fct>, conference_h <fct>, szn_win_per_v <dbl>,
## # szn_win_per_h <dbl>, ud_tot_v <dbl>, ud_tot_h <dbl>, fv_tot_v <dbl>,
## # fv_tot_h <dbl>, ud_win_per_v <dbl>, ud_win_per_h <dbl>, fv_win_per_v <dbl>,
## # fv_win_per_h <dbl>, prev_win_rec_v <dbl>, prev_win_rec_h <dbl>,
## # prev_loss_rec_v <dbl>, prev_loss_rec_h <dbl>, hc_win_per_v <dbl>,
## # hc_win_per_h <dbl>, hc_g_v <dbl>, hc_g_h <dbl>, prev_szn_win_per_v <dbl>, …
Let’s define and explain each variable in our data set.Note that we have 1 response variable, 4 non-predictor variables, and 33 predictor variables.
Definition of our Response Variable:- winner: The winner
of each NFL game(either H for home team or V for visiting team)
nfl_bet$winner %>%
summary()
## H V
## 158 106
We can see that the home team won 52 more games in the season than the visiting team. This will be further analyzed in the visual EDA of the response variable distribution.
Looking at our dimensions, we must note that not all columns will be
used as predictors in our recipe.The variables that will not be used as
predictors, but are necessary to understanding the data as a whole are:
- final_v: The final score of the visiting team for each
game. This will not be used as predictor because it will directly tell
our model the outcome of each game. - final_h: The final
score of the home team for each game. This will not be used as predictor
because it will directly tell our model the outcome of each game. -
team_v: The visiting team of each NFL game. This will not
be used as a predictor because we want reduce the cardinality of our
data set. - team_h: The home team of each NFL game. This
will not be used as a predictor because we want reduce the cardinality
of our data set.
After denoting our response variable, the variables that will not be
predictors, and removing unnecessary variables, we are left with 33
variables. Thus, we will have 33 predictor variables.Predictor
Variables: . week: NFL Week Number - The week number in
which NFL game is played. There are 17 NFL weeks in the regular season
along with playoffs which are 3 weeks long, resulting in 20 total weeks.
. same_div: Same NFL Division - This states if the two
teams playing each game are in the same NFL division. There are 8
divisions in total. A value of 1 is returned if they are not in the same
division and a value of 2 is returned if they are in the same division.
. same_conf: Same NFL Conference - This states if the two
teams playing each game are in the same NFL conference. There are 2
conferences in total. A value of 1 is returned if they are not in the
same conference and a value of 2 is returned if they are in the same
conference. . ml_v: Visitor Moneyline - This states the
visiting NFL team’s odds to win each game. A moneyline value of < 0
means they having the better odds of winning the game. A moneyline value
of > 0 means they having the better odds of losing the game. .
ml_h: Home Moneyline - This states the home NFL team’s odds
to win each game. A moneyline value of < 0 means they having the
better odds of winning the game, or the favorite. A moneyline value of
> 0 means they having the better odds of losing the game, or the
underdog. . injuries_v: Visitor Injuries - This states the
total number of injured players reported before each game for the
visiting NFL team. . injuries_h: Home Injuries - This
states the total number of injured players reported before each game for
the home NFL team. . division_v: Visitor Division - This
states the division in which the visiting NFL team is a member of. There
are 8 unique divisions, NFC: North, South, East, & West as well as
AFC: North. South, East, & West. However, for convenience, these
divisions will just be referred to as North, South, East, and West. .
division_h: Home Division - This states the division in
which the home NFL team is a member of. There are 8 unique divisions,
NFC: North, South, East, & West as well as AFC: North. South, East,
& West. However, for convenience, these divisions will just be
referred to as North, South, East, and West. .
conference_v: Visitor Conference - This states the
conference in which the visiting NFL team is a member of. There are 2
unique divisions: AFC & NFC. . conference_h: Home
Conference - This states the conference in which the home NFL team is a
member of. There are 2 unique divisions: AFC & NFC. .
szn_win_per_v: Visitor Season Winning Percentage - This is
a running total percentage that states the visiting NFL team’s winning
percentage for the entire NFL season leading up to each game. For each
respective team it is (Total Games won)/(Total Games Played) .
szn_win_per_h: Home Season Winning Percentage - This is a
running total percentage that states the home NFL team’s winning
percentage for the entire NFL season leading up to each game. For each
respective team it is (Total Games won)/(Total Games Played) .
ud_tot_v: Visitor Underdog Total - This is a running total,
leading up to each game, that states how many previous games in the
entire NFL season the visiting NFL team has had better odds of losing
the game at hand. . ud_tot_h: Home Underdog Total - This is
a running total, leading up to each game, that states how many previous
games in the entire NFL season the home NFL team has had better odds of
losing the game at hand. . fv_tot_v: Visitor Favorite Total
- This is a running total, leading up to each game, that states how many
previous games in the entire NFL season the visiting NFL team has had
better odds of winning the game at hand. . fv_tot_h: Home
Favorite Total - This is a running total, leading up to each game, that
states how many previous games in the entire NFL season the home NFL
team has had better odds of winning the game at hand. .
ud_win_per_v: Visitor Underdog Winning Percentage - This is
a running total percentage, leading up to each game, that states how
many previous games in the entire NFL season the visiting NFL team has
had better odds of losing the game at hand and won. It is (Games won as
underdog)/(Games played as underdog) . ud_win_per_h: Home
Underdog Winning Percentage - This is a running total percentage,
leading up to each game, that states how many previous games in the
entire NFL season the home NFL team has had better odds of losing the
game at hand and won. It is (Games won as underdog)/(Games played as
underdog) . fv_win_per_v: Visitor Favorite Winning
Percentage - This is a running total percentage, leading up to each
game, that states how many previous games in the entire NFL season the
visiting NFL team has had better odds of winning the game at hand and
won. It is (Games won as favorite)/(Games played as favorite).
fv_win_per_h: Home Favorite Winning Percentage - This is a
running total percentage, leading up to each game, that states how many
previous games in the entire NFL season the home NFL team has had better
odds of winning the game at hand and won. It is (Games won as
favorite)/(Games played as favorite) . prev_win_rec_v:
Visitor Previous Win Record Before Game- This is the running total of
NFL games the visiting NFL won in all previous games of the season. .
prev_win_rec_h: Home Previous Win Record Before Game- This
is the running total of NFL games the Home NFL won in all previous games
of the season. . prev_loss_rec_v: Visitor Previous Season
Loss Record Before Game - This is the running total of NFL games the
visiting NFL lost in all previous games of the season. .
prev_loss_rec_h: Home Previous Season Loss Record Before
Game - This is the running total of NFL games the Home NFL lost in all
previous games of the season. . hc_win_per_v: Visitor Head
Coach Winning Percentage - This states the running total win percentage
for the entire career of the visiting NFL team’s head coach leading up
to each game. It is calculated by (Games Won in Career as an NFL Head
Coach)/(Games Coached in Career as an NFL Head Coach) .
hc_win_per_h: Home Head Coach Winning Percentage - This
states the running total win percentage, for the entire career, of the
home NFL team’s head coach leading up to each game. It is calculated by
(Games Won in Career as an NFL Head Coach)/(Games Coached in Career as
an NFL Head Coach) . hc_g_v: Visitor Head Coach Games
Coached - This states the running total of games coached as an NFL head
coach, for the entire career, of the visiting NFL team’s head coach
leading up to each game. . hc_g_h: Home Head Coach Games
Coached - This states the running total of games coached as an NFL head
coach, for the entire career, of the home NFL team’s head coach leading
up to each game. . prev_szn_win_per_v: Visitor Previous
Season Win Percentage - This states the overall winning percentage of
the visiting NFL in the previous season, 2017-2018. .
prev_szn_win_per_h: Home Previous Season Win Percentage -
This states the overall winning percentage of the home NFL previous
season, 2017-2018. . off_bye_wk_v: Visitor Playing Off a
Bye Week - This states if the visiting NFL team had a bye week (aka. did
not play) the previous week before the game. This produces 1 if the
visiting team did play an NFL game the previous week or 2 if the
visiting team did not play an NFL game the previous week. .
off_bye_wk_h: Home Playing Off a Bye Week - This states if
the home NFL team had a bye week (aka. did not play) the previous week
before the game. This produces 1 if the home team did play an NFL game
the previous week or 2 if the home team did not play an NFL game the
previous week.
win_bar <- ggplot(nfl_bet, aes(x=winner)) +
geom_bar(color="black",
fill = "cyan") +
labs(x="Game Winner",y="# of Games Won",title="Distribution of the Number of Games Won by Home and Visiting Teams")
win_bar
As we can see, we confirm from our summary computed of the response variable that 52 more games were won by the home team. That means that about 59.8% of games were won by the home team while only about 40.2% of games were won by the visiting team. This means that there is a significant ‘home field advantage’ in this NFL season which could be a crucial factor to consider when using predictive modeling and analysis.
To understand the relationships among our numerical variables, we will create a correlation matrix. Following that, we’ll visualize these correlations using a heatmap, providing a clear graphical representation of how these predictors are related to each other.
nfl_bet %>%
select(-final_v,-final_h) %>%
select(where(is.numeric)) %>% # selecting numeric columns
cor() %>%
corrplot(method="color",type = "lower", diag = FALSE)
That may be a little to difficult to analyze as a heatmap, let’s try a correlation with numbers instead so we can get a better numerical understanding of the correlations in our data set.
nfl_bet %>%
select(-final_v,-final_h) %>%
select(where(is.numeric)) %>% # selecting numeric columns
cor() %>%
corrplot(method="number",type = "lower", diag = FALSE)
Now we have a better plot to look at. It looks as though our
strongest correlation is a negative correlation of -0.95 between our
ml_h and ml_v predictors. This makes sense as
there are very few instances where both teams have the same odds to win
a game against each other. Thus, if either the home or visiting team has
the better odds of winning the game then the opposing team would, in
turn, have the worse odds of winning the game resulting in a very strong
negative correlation. In terms of our next stongest correlation, we note
that: - prev_win_rec_h and fv_tot_h -
prev_win_rec_v and fv_tot_v are tied for the
second strongest correlation with a positive correlation of 0.87. First
off, this makes sense as if either the home or visiting team have above
the average number of wins for the season, they would also have the most
total games designated as the favorites to win. This is due to the fact
that in a majority of instances, the team with more wins will be favored
to win the game.To me, the most interesting part of this correlation
plot is that the number of injuries, whether it be the home or visiting
team, does not seem to have a strong correlation with any other
variable. I had always thought the more injuries a team had going into a
game would result in either worse odds to win the game or a worse
overall record.
Now we will create a few visualizations using the moneyline predictor to see how these odds influence the winner of each game.
First, we have a barplot of games won when the home Team is the favorite, or favored to win.
HV_ml_h_bar <- ggplot(nfl_bet, aes(x=winner,fill=ml_h<0)) +
geom_bar() +
ggtitle("Games Won When Home Team is Favored to Win") +
labs(fill = "Favorite = Home Team",x= "Game Winner",y= "# of Game Won")
HV_ml_h_bar
Next, we have a barplot of games won when the visiting team is the favorite, favored to win.
HV_ml_v_bar <- ggplot(nfl_bet, aes(x=winner,fill=ml_v<0)) +
geom_bar() +
ggtitle("Games Won When Visiting Team is Favored to Win") +
labs(fill = "Favorite = Visiting Team",x= "Game Winner",y= "# of Game Won")
HV_ml_v_bar
box <- ggplot(nfl_bet,aes(x=ml_h,y=winner, group=winner)) +
geom_boxplot(aes(fill = winner))+
theme_minimal()+
geom_jitter(alpha=0.5) +
theme(legend.position = "top") +
labs(x="Moneyline Odds")
box
Here, we have a barplot of game winners in every week of the NFL season.
week_bar <- ggplot(nfl_bet, aes(x=factor(week,levels=1:20),fill=winner)) +
geom_bar(position=position_dodge()) +
ggtitle("Games Winners Per Week") +
labs(x= "Week #",y= "# of Game Won")
week_bar
Below is a barplot of game winners when the previous number of games won is an influence.
nfl_prev_win <- nfl_bet %>%
pivot_longer(cols = c(prev_win_rec_h, prev_win_rec_v), names_to = "prev_win_type", values_to = "prev_win")
prev_win_bar <- ggplot(nfl_prev_win, aes(x=factor(prev_win,levels=0:14),fill=winner)) +
geom_bar(position=position_dodge()) +
ggtitle("Games Winners vs # of Previous Team Wins Before Game") +
labs(x= "# of Previous Team Wins Before Game",y= "# of Game Won")
prev_win_bar
Now, with this barplot we can see the distribution of wins in each specific NFL division.
nfl_div <- nfl_bet %>%
pivot_longer(cols = c(division_h, division_v), names_to = "division_type", values_to = "division")
div_bar <- ggplot(nfl_div, aes(x=division,fill=winner)) +
geom_bar(position=position_dodge()) +
ggtitle("Games Winners vs Division") +
labs(x= "Division",y= "# of Game Won")
div_bar
Similar to the previous barplot, this is a distribution of games won in each specific NFL conference.
nfl_cnf <- nfl_bet %>%
pivot_longer(cols = c(conference_h, conference_v), names_to = "conference_type", values_to = "conf")
conf_bar <- ggplot(nfl_cnf, aes(x=conf,fill=winner)) +
geom_bar(position=position_dodge()) +
ggtitle("Games Winners vs Conference") +
labs(x= "Conference",y= "# of Game Won")
conf_bar
Next is a barplot of how many games teams won dependent on how many injuries they reported before the game.
nfl_inj <- nfl_bet %>%
pivot_longer(cols = c(injuries_h, injuries_v), names_to = "inj_type", values_to = "inj")
inj_bar <- ggplot(nfl_inj, aes(x=factor(inj,levels=0:30),fill=winner)) +
geom_bar(position=position_dodge()) +
ggtitle("Games Winners vs # of Injuries") +
labs(x= "# of Injuries",y= "# of Game Won")
inj_bar
Here, we have the number of games won dependent on the number of games they had already played as the underdog.
nfl_ud <- nfl_bet %>%
pivot_longer(cols = c(ud_tot_h, ud_tot_v), names_to = "ud_type", values_to = "ud")
ud_bar <- ggplot(nfl_ud, aes(x=factor(ud,levels=0:30),fill=winner)) +
geom_bar(position=position_dodge()) +
ggtitle("Games Winners vs # of Games as Underdog") +
labs(x= "# of Games as Underdog",y= "# of Game Won")
ud_bar
Finally, we have the number of games won dependent on the number of games they had already played as the favorite.
nfl_fv <- nfl_bet %>%
pivot_longer(cols = c(fv_tot_h, fv_tot_v), names_to = "fv_type", values_to = "fv")
fv_bar <- ggplot(nfl_fv, aes(x=factor(fv,levels=0:30),fill=winner)) +
geom_bar(position=position_dodge()) +
ggtitle("Games Winners vs # of Games as Favorite") +
labs(x= "# of Games as Favorite",y= "# of Game Won")
fv_bar
Having gained insights into some key variables that influence whether the home or visiting team wins an NFL game, we can now progress to the model-building portion of our project.
Prior to building our models, we must first split our data into training and testing subsets of the original data set in order to accurately choose a model that fits and predicts our data best. Note that before we split though, we must set our seed so that our random data split has the ability to be reproduced every time we train and test our models.
# setting seed for ability to reproduce
set.seed(2101)
# splitting the data at 70/30
nfl_split <- initial_split(nfl_bet, prop = 0.70, strata = winner)
# training subset (70)
nfl_train <- training(nfl_split)
# testing subset (30)
nfl_test <- testing(nfl_split)
Below are the dimensions of our NFL training data set:
nfl_train
dim(nfl_train)
## [1] 184 38
With this training data set we will train a multitude of different classification models based on the values of the predictors and response variable in the training data set itself in combination with the recipe we will create in the next section. With this testing data set we also created in the split, we will only test the models that have performed the best on training data set. The main purpose of this split is to verify that the model we deem to be the best fit can perform accurately on data it has not seen or been trained on yet.
Below are the dimensions of our NFL testing data set:
nfl_test
dim(nfl_test)
## [1] 80 38
I chose a proportion of 0.7 due to the fact that it allows for an ample amount of training data while still retaining enough data to be tested. Although there are limited observations in total, this proportion is necessary for an accurate model to be constructed.The training data subset has 184 observations.The testing data subset has 80 observations.
Now, we will combine our predictor variables and response variable
into a machine learning essential called a recipe.In relation to our
data set at hand, a recipe in machine learning is essentially similar to
compiling an NFL roster. An NFL roster is comprised of a multitude of
players that come from all walks of life but they each have a key
football component that they bring to the team. In order to create an
NFL roster that is successful, one must bring in players with football
components that compliment each others attributes, however there are
many trials and errors that an NFL team must go through before they are
able to create the overall best NFL roster with the resources available.
Similar to machine learning, instead of football components we use
predictor variables and instead of an NFL roster we are composing a
recipe that predicts the outcome of a game. Just like in football, we
will have many trials and errors with our recipe, however we will
eventually use the resources we have, training data set observations, to
create the most successful recipe possible.Let’s treat our recipe like
an NFL team. First we must cut the players on our roster that do not
service our team. In our case we are removing the variables:
team_h and team_v to reduce cardinality.We are
also removing the variables: final_h and
final_v because having these variables in our recipe will
be detrimental to our model training due to the fact that they are
obvious identifiers of which team won the game.Note that because all
these variables are removed when the recipe is created, when the testing
data is implemented, at the end of the process, the recipe will ignore
these variables as it has not been trained on them.
# building our recipe
recipe_nfl <- recipe(winner~.,data=nfl_train) %>%
step_rm(final_h,final_v,team_v,team_h) %>%
step_dummy(same_div) %>% # dummy predictor on categorical variables
step_dummy(same_conf) %>%
step_dummy(division_h) %>%
step_dummy(division_v) %>%
step_dummy(conference_h) %>%
step_dummy(conference_v) %>%
step_dummy(off_bye_wk_h) %>%
step_dummy(off_bye_wk_v) %>%
step_center(all_predictors()) %>% # standardizing our predictors
step_scale(all_predictors())
Next, we will use stratified sampling for cross-validation on our
response variable winnwe because a more accurate and
precise representation of the sample is produced when the samples are
selected based on the same proportion as the population. We will use 10
folds to perform stratified cross-validation.
nfl_fold <- vfold_cv(nfl_train, v=10, strata = "winner")
In this portion of the project we will take a peek into the roadmap for model building construction, which models we built using this roadmap, how we were able to efficiently create and produce each model, performance metrics we use to rank each model, details about each model, the best models we were able to compute from each model, and some tune plots and graphs from specific models.
In the next section we will state each of the 12 models that will be built, fit, and analyzed; however, we must first explain how these models were constructed.
As a quick side note, although our training data set only consists of
184 observations, because we have 33 predictor variables and are fitting
12 different models each individual model was created, fit, and tuned in
separate R files in order to maximize efficiency and organized. Each of
these R files used the feature save() to save each model in
an RDA file. We will use the feature load() to produce
these models in this specific R file. Note that all of the separate R
files can be found attached to this project in the data folder attached
to this project.
To analyze each of our model’s we must first load in each separate R file in which these models were created and trained in. We load these R files in below:
load("nfl_log_reg.rda")
load("1nfl_knn.rda")
load("nfl_lda.rda")
load("nfl_qda.rda")
load("1nfl_en.rda")
load("nfl_rf_5.rda")
load("nfl_rf_6.rda")
load("nfl_bt_5.rda")
load("nfl_bt_6.rda")
load("1nfl_dt.rda")
load("nfl_svm_lin.rda")
load("nfl_svm_rbf.rda")
The actual construction of the models in fairly simple. Below is a
summarized roadmap of the process done for each model: 1. The data set
is loaded into the respective R file for the model using the command
load("nfl_bet.rda"). 2. The seed is set to
seed(2101) for reproducing purposes. 3. The data is split
into a training set nfl_train and testing set
nfl_test with prop = 0.7 and
strata = winner. 4. The recipe recipe_nfl is
created exactly as it was in the Building Recipe section of
this project.
5. Using vfold_cv the training set is set to 10 folds for
stratified cross validation. 6. Then the model specific engine, with
set_mode = "classification" being true for every model, is
added. 7. A workflow is created with the model specific engine and the
recipe recipe_nfl. (For Logistic Regression, LDA, and QDA
we skip steps 8 - 10 and proceed from step 11) 8. The tune grid with
model specific parameters and levels is created. 9. The model is tuned
and the best tune for the model is found using
select_best(). 10. Using the best tune, the workflow is
finalized. 11. We fit the model with workflow to our training data set.
12. We save that fitted model (and tune if applicable) to an RDA file
using save() so that the models and tunes can be
efficiently accessed.
In terms of ranking and seeing how each of these models stack up
against each other we will run performance evaluations using the values
of accuracy and roc_auc on the estimates
created from each model built using the recipe we created earlier in
combination with the training data.
In terms of finding the overall best model for predicting the outcome
of NFL games will we find the model with the highest
accuracy value and the model with the highest
roc_auc value. Using both of these models will give us a
higher chance of finding a model that performs well on both the testing
data set and other new, unseen data.
While not the overall best metric to choose which model will perform
well on the testing data or new data, analyzing the
accuracy values of each model can be informative due to
their straightforward interpretation. Although this data is not
necessarily completely balanced, for example the number of wins per home
team and the number of win per visiting team, pairing
accuracy with our next performance metric will give us a
more clear picture of the each model’s capability.
In terms of computation, the accuracy values for
classification models are computed as a proportion of the correctly
predicted instances out of the total instances.
The formula is: \[ \text{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}} \]
The second performance metric we will examine are the Receiver
Operating Characteristic Area Under the Curve values. More commonly
known as its acronym, ROC AUC, this performance evaluate is much more
useful more imbalanced data sets. By evaluating each respective model’s
ability to distinguish between classes, the roc_auc values
provide singular measures of the quality of each model’s performance and
predictions irrespective of any classification threshold.
In terms of computation, the roc_auc values for
classification models are computed using integrals by calculating the
area under the curve formed by plotting the True Positive Rate (TPR)
against the False Positive Rate (FPR) at various threshold (Specificity)
settings.
On a simplified level the formulas for TPR and FPR are: \[ \text{TPR} = \frac{\text{TP}}{\text{TP+FP}} \] and \[ \text{FPR} = 1-\text{Specificity}=1-\frac{\text{TN}}{\text{TN+FP}}=\frac{\text{FP}}{\text{TN+FP}} \] where \[ \text{Specificity}=\frac{\text{TN}}{\text{TN+FP}}\]
Note that: \[ \text{TN}\] represents the number of True Negatives \[ \text{TP}\] represents the number of True Positives \[ \text{FP}\] represents the number of False Positives
Due to our project using the classification approach there are 12 different models that we will fit our data to. These models are: . Logistic Regression . K-Nearest Neighbors . Linear Discriminant Analysis . Quadratic Discriminant Analysis . Decision Trees . Random Forests at 5 Levels . Random Forests at 6 Levels . Boosted Trees at 5 Levels . Boosted Trees at 6 Levels . Elastic Net - Lasso . Support Vector Machines - Linear Kernel . Support Vector Machines - Radial Kernel
Before we produce these models, however, let’s run through the characteristics of each model used in relation to this project.
Once again we note that for each model the mode was set to
"classification".
The Logistic Regression model assumes a linear relationship between the log-odds of the dependent variable and the predictors. It models the probability that the home team wins based on the predictors.
The engine used for this model is: "glm"
The K-Nearest Neighbors (KNN) model uses previous games (neighbors) to classify the game at hand.
The engine used for this model is: "kknn".
The Linear Discriminant Analysis (LDA) model tries to find a linear combination of predictors that best separates whether the home or visiting team won while assuming the home and visiting team share the same covariance matrix.
The engine used for this model is:"MASS".
The Quadratic Discriminant Analysis (QDA) model does the same as LDA but it allows the home and visiting team to each have their own respective covariance matrices.
The engine used for this model is:"MASS".
The Decision Trees model splits the data into branches based on the values of the predictors to create a tree structure of relationships and interactions between predictor variables.
The engine used for this model is:"rpart".
autoplot() of Decision Trees model tune:
autoplot(tune_tree)
best_dt <- select_best(tune_tree, metric = "roc_auc")
best_dt
## # A tibble: 1 × 2
## cost_complexity .config
## <dbl> <chr>
## 1 0.0001 Preprocessor1_Model01
The best Decision Tree model was Model 1 with
cost_complexity = 0.0001000000
Below is an rpart.plot of Model 1
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 4.3.2
## Loading required package: rpart
##
## Attaching package: 'rpart'
## The following object is masked from 'package:dials':
##
## prune
tree_final_fit %>%
extract_fit_engine() %>%
rpart.plot(roundint=FALSE)
The Random Forests model builds numerous singular decision trees and merges them to get even better predictions while also being less prone to over fitting than singular decision trees.
For this project, due the number of predictor variables at 33, the random forests model was not able to tune at the full 33 levels. Therefore, we chose the use the square root of 33. However since this is not natural number, 5.74456264654, the levels of 5 and 6 were used to account for cover all bases.
The engine used for this model at both levels
is:"ranger" with importance = "impurity".
autoplot() of Random Forests model tune with 5
Levels
autoplot(tune_rf_5) + theme_minimal()
best_rf_5 <- select_best(tune_rf_5, metric = "roc_auc")
best_rf_5
## # A tibble: 1 × 4
## mtry trees min_n .config
## <int> <int> <int> <chr>
## 1 1 175 12 Preprocessor1_Model031
The best random forest model with 5 levels was Model 31 with
mtry = 1, min_n = 12, and trees =
12.
autoplot() of Random Forests model tune with 6
Levels
autoplot(tune_rf_6) + theme_minimal()
best_rf_6 <- select_best(tune_rf_6, metric = "roc_auc")
best_rf_6
## # A tibble: 1 × 4
## mtry trees min_n .config
## <int> <int> <int> <chr>
## 1 1 100 10 Preprocessor1_Model001
The best random forest model with 6 levels was Model 1 with
mtry = 1, min_n = 10, and trees =
100.
The Boosted Trees model combines numerous predictors that deemed to be average or weak predictors to build a strong predictor by way of correcting the mistakes made in previous trees.
Just like the random forests models, for this project, due the number of predictor variables at 33, the boosted trees model was not able to tune at the full 33 levels. Therefore, we chose the use the square root of 33. However since this is not natural number, 5.74456264654, the levels of 5 and 6 were used to account for cover all bases.
The engine used for this model at both levels
is:"xgboost".
autoplot() of Boosted Trees model tune with 5 levels
autoplot(tune_bt_nfl_5) + theme_minimal()
best_bt_5 <- select_best(tune_bt_nfl_5, metric = "roc_auc")
best_bt_5
## # A tibble: 1 × 4
## mtry trees learn_rate .config
## <int> <int> <dbl> <chr>
## 1 1 100 0.00000316 Preprocessor1_Model011
The best boosted trees model with 5 levels was Model 11 with
mtry = 1, trees = 100, and
learn_rate = 3.162278e-06.
autoplot() of Boosted Trees model tune with 6 levels
autoplot(tune_bt_nfl_6) + theme_minimal()
best_bt_6 <- select_best(tune_bt_nfl_6, metric = "roc_auc")
best_bt_6
## # A tibble: 1 × 4
## mtry trees learn_rate .config
## <int> <int> <dbl> <chr>
## 1 1 100 0.00158 Preprocessor1_Model025
The best boosted trees model with 6 levels was Model 25 with
mtry = 1, trees = 100, and
learn_rate = 0.001584893.
The Elastic Net - Lasso model is mainly useful when multiple predictors are correlated by way of using the regularization technique Lasso to improve the model’s prediction power.
The engine used for this model is:"glmnet".
autoplot() of Elastic Net - Lasso model tune
autoplot(tune_res_nfl)
## Warning: Transformation introduced infinite values in continuous x-axis
## Transformation introduced infinite values in continuous x-axis
best_en_nfl <- select_best(tune_res_nfl, metric = "roc_auc", penalty, mixture)
best_en_nfl
## # A tibble: 1 × 3
## penalty mixture .config
## <dbl> <dbl> <chr>
## 1 0.333 0.333 Preprocessor1_Model034
The best Elastic Net - Lasso model was Model 34 with
penalty = 0.3333333 and mixture =
0.3333333.
The Support Vector Machines (SVM) - Linear Kernel model is useful for linearly separable data, or can be separate using a single line, that also has a multitude of predictors.
The engine used for this model is:"kernlab".
autoplot() of SVM - Linear Kernel tune
autoplot(svm_nfl_res)
best_svm <- select_best(svm_nfl_res, metric = "roc_auc")
best_svm
## # A tibble: 1 × 2
## cost .config
## <dbl> <chr>
## 1 0.00521 Preprocessor1_Model01
The best SVM - Linear Kernel model was Model 1 with cost
= 0.005212106.
The Support Vector Machines (SVM) - Radial Kernel model expands on the definition of SVM - Linear Kernel by allowing SVM to produce non-linear boundaries that are complex which is useful for non-linear and complex relationships between predictors.
The engine used for this model is:"kernlab".
autoplot() of SVM - Radial Kernel tune
autoplot(svm_rbf_res)
best_rbf_svm <- select_best(svm_rbf_res, metric = "roc_auc")
best_rbf_svm
## # A tibble: 1 × 2
## cost .config
## <dbl> <chr>
## 1 0.108 Preprocessor1_Model03
The best SVM - Radial Kernel model was Model 3 with cost
= 0.1077689.
We can now analyze the performance metrics of each model and determine our best and worst models.
So close to the finish line!
Next, before we look further into any of the models used, let’s rank the accuracy values of all our models
# multi metric needed to find accuracy of each model's predictions
multi_metric <- metric_set(accuracy, sensitivity, specificity)
# accuracy of logistic regression model's predictions
log_acc <- augment(nfllog_fit, new_data = nfl_train) %>%
accuracy(truth = winner, estimate = .pred_class)
# accuracy of LDA model's predictions
lda_acc <- augment(nfl_lda_fit, new_data = nfl_train) %>%
accuracy(truth = winner, estimate = .pred_class)
# accuracy of QDA model's predictions
qda_acc <- augment(nfl_qda_fit, new_data = nfl_train) %>%
accuracy(truth = winner, estimate = .pred_class)
# accuracy of knn model's predictions
knn_acc <- augment(nfl_knn_fit, new_data = nfl_train) %>%
accuracy(truth = winner, estimate = .pred_class)
# accuracy of decision trees model's predictions
dt_acc <- augment(tree_final_fit, new_data = nfl_train) %>%
accuracy(truth = winner, estimate = .pred_class)
# accuracy of elastic net model's predictions
en_acc <- augment(en_final_nfl, new_data = nfl_train) %>%
accuracy(truth = winner, estimate = .pred_class)
# accuracy of random forests at 5 levels model's predictions
rf_5_acc <- augment(final_rf_fit_5, new_data = nfl_train) %>%
accuracy(truth = winner, estimate = .pred_class)
# accuracy of random forests at 6 levels model's predictions
rf_6_acc <- augment(final_rf_fit_6, new_data = nfl_train) %>%
accuracy(truth = winner, estimate = .pred_class)
# accuracy of boosted trees at 5 levels model's predictions
bt_5_acc <- augment(final_bt_fit_5, new_data = nfl_train) %>%
accuracy(truth = winner, estimate = .pred_class)
# accuracy of boosted trees at 6 levels model's predictions
bt_6_acc <- augment(final_bt_fit_6, new_data = nfl_train) %>%
accuracy(truth = winner, estimate = .pred_class)
# accuracy of linear SVM model's predictions
lin_svm_acc <- augment(svm_final_fit, new_data = nfl_train) %>%
accuracy(truth = winner, estimate = .pred_class)
# accuracy of radial SVM model's predictions
r_svm_acc <- augment(svm_rbf_final_fit, new_data = nfl_train) %>%
accuracy(truth = winner, estimate = .pred_class)
accuracies_acc <- c(log_acc$.estimate,
lda_acc$.estimate,
qda_acc$.estimate,
knn_acc$.estimate,
dt_acc$.estimate,
en_acc$.estimate,
rf_5_acc$.estimate,
rf_6_acc$.estimate,
bt_5_acc$.estimate,
bt_6_acc$.estimate,
lin_svm_acc$.estimate,
r_svm_acc$.estimate
)
models_acc <- c("Logistic Regression",
"LDA",
"QDA",
"KNN",
"Decision Trees",
"Elastic Net: Lasso",
"Random Forests: 5 Levels",
"Random Forests: 6 Levels",
"Boosted Trees: 5 Levels",
"Boosted Trees: 6 Levels",
"Linear SVM",
"Radial SVM"
)
results_acc <- tibble(accuracies_acc = accuracies_acc, models_acc = models_acc)
results_acc %>%
arrange(-accuracies_acc)
## # A tibble: 12 × 2
## accuracies_acc models_acc
## <dbl> <chr>
## 1 0.967 QDA
## 2 0.924 Random Forests: 6 Levels
## 3 0.897 Random Forests: 5 Levels
## 4 0.848 KNN
## 5 0.793 Decision Trees
## 6 0.761 Boosted Trees: 6 Levels
## 7 0.75 Boosted Trees: 5 Levels
## 8 0.745 LDA
## 9 0.734 Logistic Regression
## 10 0.609 Linear SVM
## 11 0.598 Elastic Net: Lasso
## 12 0.598 Radial SVM
The accuracy values of each model’s predicted values do give us an
insight into how well our models performed. Although these values are
not as important, in terms of classification purposes, as the value of
each model’s roc_auc value, they are extremely informative
for a surface level interpretation of each model’s performance.
Let’s also get a visual of how each model’s accuracy
compares to every other model in the format of a bar plot
acc_bar <- ggplot(results_acc, aes(x = factor(models_acc, levels = models_acc[order(accuracies_acc, decreasing = TRUE)]), y = accuracies_acc, fill = models_acc)) +
geom_bar(stat = "identity") +
theme_minimal() +
labs(x = "Models", y = "Accuracy") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) # Rotate x labels to avoid overlap
acc_bar
This visual is a little crowded, we will use a dot plot to get a
better visual of the accuracy values
acc_dot <- ggplot(results_acc, aes(x = factor(models_acc, levels = models_acc[order(accuracies_acc, decreasing = TRUE)]), y = accuracies_acc)) +
geom_point(size =10, color = "cyan") +
geom_segment(aes(x = models_acc,
xend = models_acc,
y=min(accuracies_acc),
yend = max(accuracies_acc)),
linetype = "dashed",
linewidth=0.5) +
labs(title = "Performance of Our Models", x = "Models",y="Accuracy Value") +
theme_minimal() +
coord_flip()
acc_dot
Thus, we can see both visually and numerically that:
Our model with the lowest accuracy value was a tie
between: Elastic Net - Lasso & Radial SVM
Our model with the median accuracy value was:
Boosted Trees with 5 Levels
Our model with the best accuracy value was:
Quadratic Discriminant Analysis
While the accuracy values of the predicted outcomes are
an interesting metric to look at, we move on to ranking our
roc_auc values which are much more important when it comes
to choosing the best model without threshold interference. Below is
every model’s roc_auc value ranked from best to worst
performance.
# roc_auc of logistic regression model's predictions
log_roc <- augment(nfllog_fit, new_data = nfl_train) %>%
roc_auc(winner, .pred_H)
# roc_auc of LDA model's predictions
lda_roc <- augment(nfl_lda_fit, new_data = nfl_train) %>%
roc_auc(winner, .pred_H)
# roc_auc of QDA model's predictions
qda_roc <- augment(nfl_qda_fit, new_data = nfl_train) %>%
roc_auc(winner, .pred_H)
# roc_auc of knn model's predictions
knn_roc <- augment(nfl_knn_fit, new_data = nfl_train) %>%
roc_auc(winner, .pred_H)
# roc_auc of decision trees model's predictions
dt_roc <- augment(tree_final_fit, new_data = nfl_train) %>%
roc_auc(winner, .pred_H)
# roc_auc of elastic net model's predictions
en_roc <- augment(en_final_nfl, new_data = nfl_train) %>%
roc_auc(winner, .pred_H)
# roc_auc of random forests at 5 levels model's predictions
rf_5_roc <- augment(final_rf_fit_5, new_data = nfl_train) %>%
roc_auc(winner, .pred_H)
# roc_auc of random forests at 6 levels model's predictions
rf_6_roc <- augment(final_rf_fit_6, new_data = nfl_train) %>%
roc_auc(winner, .pred_H)
# roc_auc of boosted trees at 5 levels model's predictions
bt_5_roc <- augment(final_bt_fit_5, new_data = nfl_train) %>%
roc_auc(winner, .pred_H)
# roc_auc of boosted trees at 6 levels model's predictions
bt_6_roc <- augment(final_bt_fit_6, new_data = nfl_train) %>%
roc_auc(winner, .pred_H)
# roc_auc of linear SVM model's predictions
lin_svm_roc <- augment(svm_final_fit, new_data = nfl_train) %>%
roc_auc(winner, .pred_H)
# roc_auc of radial SVM model's predictions
r_svm_roc <- augment(svm_rbf_final_fit, new_data = nfl_train) %>%
roc_auc(winner, .pred_H)
roc_auc <- c(log_roc$.estimate,
lda_roc$.estimate,
qda_roc$.estimate,
knn_roc$.estimate,
dt_roc$.estimate,
en_roc$.estimate,
rf_5_roc$.estimate,
rf_6_roc$.estimate,
bt_5_roc$.estimate,
bt_6_roc$.estimate,
lin_svm_roc$.estimate,
r_svm_roc$.estimate
)
models <- c("Logistic Regression",
"LDA",
"QDA",
"KNN",
"Decision Trees",
"Elastic Net: Lasso",
"Random Forests: 5 Levels",
"Random Forests: 6 Levels",
"Boosted Trees: 5 Levels",
"Boosted Trees: 6 Levels",
"Linear SVM",
"Radial SVM"
)
results_roc <- tibble(roc_auc = roc_auc, models = models)
results_roc %>%
arrange(-roc_auc)
## # A tibble: 12 × 2
## roc_auc models
## <dbl> <chr>
## 1 0.997 Random Forests: 6 Levels
## 2 0.996 QDA
## 3 0.995 Random Forests: 5 Levels
## 4 0.954 Boosted Trees: 6 Levels
## 5 0.951 KNN
## 6 0.948 Boosted Trees: 5 Levels
## 7 0.889 Radial SVM
## 8 0.874 Decision Trees
## 9 0.804 Logistic Regression
## 10 0.802 LDA
## 11 0.759 Linear SVM
## 12 0.665 Elastic Net: Lasso
Let’s also get a visual of how each model’s roc_auc
compares to every other model in the format of a bar plot
roc_bar <- ggplot(results_roc, aes(x = factor(models, levels = models[order(roc_auc, decreasing = TRUE)]), y = roc_auc, fill = models)) +
geom_bar(stat = "identity") +
theme_minimal() +
labs(x = "Models", y = "ROC AUC") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) # Rotate x labels to avoid overlap
roc_bar
This visual is a little crowded, we will use a dot plot to get a
better visual of the roc_auc values
roc_dot <- ggplot(results_roc, aes(x = factor(models, levels = models[order(roc_auc, decreasing = TRUE)]), y = roc_auc)) +
geom_point(size =10, color = "cyan") +
geom_segment(aes(x = models,
xend = models,
y=min(roc_auc),
yend = max(roc_auc)),
linetype = "dashed",
linewidth=0.5) +
labs(title = "Performance of Our Models", x = "Models",y="ROC_AUC Value") +
theme_minimal() +
coord_flip()
roc_dot
Thus, we can see both visually and numerically that:
Our model with the lowest roc_auc value was: Elastic
Net - Lasso
Our model with the median roc_auc value was: Boosted
Trees with 5 Levels
Our model with the best roc_auc value was: Random
Forests Model with 6 Levels
Let’s take a look at our model that performed the worst, in terms of
roc_auc & accuracy value, and analyze
graphically why the model performed so poorly.
We will produce the metrics, compute the best model found, print the confusion matrix of the number of correct and incorrect predicted values, and analyze the ROC curve graph of the Elastic Net - Lasso model.
From these, we will be able to comprehend why the Elastic Net - Lasso
model attained the lowest roc_auc and accuracy
values of all models using the training data.
Let’s begin with the worst performing model at a roc_auc
value of 0.6647420 and a accuracy value of 0.5978261. ####
Metrics and Best Model First, we will analyze the metrics of the Elastic
Net - Lasso model.
multi_metric <- metric_set(accuracy, sensitivity, specificity)
augment(en_final_nfl, new_data = nfl_train) %>%
multi_metric(truth = winner, estimate = .pred_class)
## # A tibble: 3 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.598
## 2 sensitivity binary 1
## 3 specificity binary 0
We see that the metrics of this model were above average with the
accuracy of predicted values at 59.78261%.
However, in comparison to the rest of the models the Elastic Net - Lasso model was dead last.
In terms of gap, the accuracy value was 0.3986112 lower
than the model with the best accuracy value and 0.1521739
lower than the model with the median accuracy value.
Next, let’s see the best model for Elastic Net - Lasso that was chosen.
best_en_nfl <- select_best(tune_res_nfl, metric = "roc_auc", penalty, mixture)
best_en_nfl
## # A tibble: 1 × 3
## penalty mixture .config
## <dbl> <dbl> <chr>
## 1 0.333 0.333 Preprocessor1_Model034
According to our roc_auc criteria, the model with
penalty = 0.3333333, mixture = 0.3333333, and
.config = Preprocessor1_Model072 performs the best for this
model. This is backed up by our autoplot() function in a
previous section.
Now, we produce the confusion matrix of the predicted values for the training data.
suppressMessages({
augment(en_final_nfl, new_data = nfl_train) %>%
conf_mat(truth = winner, estimate = .pred_class) %>%
autoplot(type = "heatmap") + scale_fill_gradient(low = "red", high = "green")
})
We see that this model correctly predicted 112 games. With 110 games correctly predicting the home team won. And 2 games correctly predicting the visiting team won.
We also see this model incorrectly predicted 64 games. With 72 games incorrectly predicting the home team won. And 0 games incorrectly predicting the visiting team won.
Thus, we understand that the poor performance of this model is most likely attributed to its inability to predict when the visiting team won.
Last, take a look at the ROC curve produced from our Elastic Net - Lasso model using the training data.
augment(en_final_nfl, new_data = nfl_train) %>%
roc_curve(winner, .pred_H) %>%
autoplot()
We conclude that the low roc_auc is confirmed by the
poor ROC curve of this model. The characteristic that makes this ROC
curve poor is the fact that the ROC curve is closer to the mid line than
the top left corner.
Let’s take a look at our top two models that performed the best, in
terms of roc_auc & accuracy value, and
analyze graphically why the models performed so well.
We will produce the metrics, compute the best model found, print the confusion matrix of the number of correct and incorrect predicted values, and analyze the ROC curve graph of the Quadratic Discriminant Analysis and Random Forests with 6 Levels models.
From these, we will be able to comprehend why the Quadratic
Discriminant Analysis and Random Forests with 6 Levels models attained
the two highest roc_auc values of all models using the
training data.
Let’s begin with the model with the best accuracy value
of 0.9673913 and the second best roc_auc value of
0.9964373.
First, we will analyze the metrics of the QDA model.
multi_metric <- metric_set(accuracy, sensitivity, specificity)
augment(nfl_qda_fit, new_data = nfl_train) %>%
multi_metric(truth = winner, estimate = .pred_class)
## # A tibble: 3 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.967
## 2 sensitivity binary 0.973
## 3 specificity binary 0.959
We see that this model did very well in terms of metrics with the
accuracy of predicted values at 96.73913%.
Now, we produce the confusion matrix of the predicted values for the training data.
suppressMessages({
augment(nfl_qda_fit, new_data = nfl_train) %>%
conf_mat(truth = winner, estimate = .pred_class) %>%
autoplot(type = "heatmap") + scale_fill_gradient(low = "red", high = "green")
})
We see that this model correctly predicted 178 games. With 107 games correctly predicting the home team won. And 71 games correctly predicting the visiting team won.
We also see this model incorrectly predicted 6 games. With 3 games incorrectly predicting the home team won. And 3 games incorrectly predicting the visiting team won.
Last, take a look at the ROC curve produced from our QDA model using the training data.
augment(nfl_qda_fit, new_data = nfl_train) %>%
roc_curve(winner, .pred_H) %>%
autoplot()
We conclude that the high roc_auc is confirmed by the
great ROC curve of this model. The characteristic that makes this ROC
curve great is the fact that the ROC curve is very near the top left
corner of the graph itself.
Let’s finish with the model with the best roc_auc value
of 0.9974201 and the second best accuracy value of
0.9239130.
First, we will analyze the metrics of the Random Forests Model with 6 Levels model.
multi_metric <- metric_set(accuracy, sensitivity, specificity)
augment(final_rf_fit_6, new_data = nfl_train) %>%
multi_metric(truth = winner, estimate = .pred_class)
## # A tibble: 3 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.924
## 2 sensitivity binary 1
## 3 specificity binary 0.811
We see that this model did very well in terms of metrics with the
accuracy of predicted values at 92.39130%.
Next, let’s see the best model for Random Forests Model with 6 Levels that was chosen.
best_rf_6 <- select_best(
tune_rf_6, metric = "roc_auc"
)
best_rf_6
## # A tibble: 1 × 4
## mtry trees min_n .config
## <int> <int> <int> <chr>
## 1 1 100 10 Preprocessor1_Model001
According to our roc_auc criteria, the model with
mtry = 1, trees = 100, and min_n
= 10 performs the best for this model. This is backed up by our
autoplot() function in a previous section.
Now, we produce the confusion matrix of the predicted values for the training data.
suppressMessages({
augment(final_rf_fit_6, new_data = nfl_train) %>%
conf_mat(truth = winner, estimate = .pred_class) %>%
autoplot(type = "heatmap") + scale_fill_gradient(low = "red", high = "green")
})
We see that this model correctly predicted 170 games. With 110 games correctly predicting the home team won. And 60 games correctly predicting the visiting team won.
We also see this model incorrectly predicted 14 games. With 14 games incorrectly predicting the home team won. And 0 games incorrectly predicting the visiting team won.
Last, take a look at the ROC curve produced from our Random Forests Model with 6 Levels model using the training data.
augment(final_rf_fit_6, new_data = nfl_train) %>%
roc_curve(winner, .pred_H) %>%
autoplot()
We conclude that the high roc_auc is confirmed by the
great ROC curve of this model. The characteristic that makes this ROC
curve great is the fact that the ROC curve, while not grossly better, is
even closer to the top left corner of the graph than the QDA model.
Now that we have determined our two best models Quadratic
Discriminant Analysis by way of highest accuracy value and
Random Forests Model with 6 Levels by way of highest
roc_auc value let’s fit both models separately to the test
data and review our results.
This is pretty much our machine learning Super Bowl!
First, we check how accurately this model predicted the winner of each NFL game in the testing data set.
multi_metric <- metric_set(accuracy, sensitivity, specificity)
qda.acc<-augment(nfl_qda_fit, new_data = nfl_test) %>%
accuracy(truth = winner, estimate = .pred_class)
qda.acc
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.625
Before we check the roc_auc for this model with the
testing data set, let’s look at the ROC curve first.
augment(nfl_qda_fit, new_data = nfl_test) %>%
roc_curve(winner, .pred_H) %>%
autoplot()
Now we check how well the model performed with the testing data in
terms of roc_auc value:
qda.roc<-augment(nfl_qda_fit, new_data = nfl_test) %>%
roc_auc(winner, .pred_H)
qda.roc
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 roc_auc binary 0.618
First, we check how accurately this model predicted the winner of each NFL game in the testing data set.
multi_metric <- metric_set(accuracy, sensitivity, specificity)
rf6.acc<-augment(final_rf_fit_6, new_data = nfl_test) %>%
accuracy(truth = winner, estimate = .pred_class)
rf6.acc
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.7
Before we check the roc_auc for this model with the
testing data set, let’s look at the ROC curve first.
augment(final_rf_fit_6, new_data = nfl_test) %>%
roc_curve(winner, .pred_H) %>%
autoplot()
Now we check how well the model performed with the testing data in
terms of roc_auc value:
rf6.roc<-augment(final_rf_fit_6, new_data = nfl_test) %>%
roc_auc(winner, .pred_H)
rf6.roc
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 roc_auc binary 0.736
Let’s see which model had the better accuracy value
using the testing data set.
best_mod <- c("QDA","Random Forests Model with 6 Levels")
acc_fin <- c(qda.acc$.estimate,rf6.acc$.estimate)
best_acc <- tibble(accuracy = acc_fin, models = best_mod)
best_acc %>%
arrange(-acc_fin)
## # A tibble: 2 × 2
## accuracy models
## <dbl> <chr>
## 1 0.7 Random Forests Model with 6 Levels
## 2 0.625 QDA
Thus, we see that the Random Forests Model with 6 Levels performed
better in terms of the accuracy value performance
metric
Now let’s see how the two match up when it comes to
roc_auc values
best_mod <- c("QDA","Random Forests Model with 6 Levels")
roc_fin <- c(qda.roc$.estimate,rf6.roc$.estimate)
best_roc <- tibble(roc_auc = roc_fin, models = best_mod)
best_roc %>%
arrange(-roc_fin)
## # A tibble: 2 × 2
## roc_auc models
## <dbl> <chr>
## 1 0.736 Random Forests Model with 6 Levels
## 2 0.618 QDA
Once again the Random Forests Model with 6 Levels performed better.
Thus, we have determined that our best model in terms of both
accuracy value and roc_auc value is the Random
Forests Model with 6 Levels.
Although our best model of the two only had an accuracy
value of 70% and a roc_auc value of 0.7356771, this is far
and away a victory in my eyes.
Before we conclude, let’s take a peek at the variable importance plot to understand which predictors played the largest part in influencing the results of this model
library(vip)
## Warning: package 'vip' was built under R version 4.3.2
##
## Attaching package: 'vip'
## The following object is masked from 'package:utils':
##
## vi
final_rf_fit_6 %>%
extract_fit_parsnip() %>%
vip(aesthetics = list(fill = "cyan", color = "black")) +
theme_minimal()
I guess Vegas really does have the greatest influence, or at least in our model it does!
When thinking about potential improvements or adjustments to this
model I do believe that there could be a better boosted trees or random
forests model that can be built with a program that contains a more
powerful processing system. Because we were not able to tune both of
these models at the level of the total number of predictors, and instead
had to settle with the square root, I do believe that we missed some
models that potentially could have created models with even better
roc_auc and accuracy values. Unfortunately, I
attempted to install a package that used all 10 cores of my laptop,
instead of the 1 core that R uses, but this package had never been
updated for the newest version of R so I was not able to utilize it. The
last potential improvement that I would recommend is more predictors.
Even though we had a decent amount of predictors for this model, I do
believe we missed some predictors, such as weather and
player/team-specific statistics, that could possibly lead to the
creation of a model that performs better on testing and unseen data.
In terms of an NFL game, or any sports game for that matter, the
ability to predict the outcome is very difficult with the amount of
variables that just simply cannot be accounted for or put into
categorical and numeric values. In my opinion even though our
accuracy and roc_auc values may seem low for a
machine learning project the fact that we were able to produce values
about 60% or 0.6 is very impressive.
Throughout this entire project I faced many challenges and found myself stuck for hours and even days trying to fix problems and build models that actually worked. While this may have been frustrating in the moment, I am immensely proud of the results that I was able to produce with this project and will effectively use this Random Forests Model with 6 Levels for my own personal enjoyment (and possibly betting wagers for fun).